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Abstract 

Background: Exposure to atmospheric particulate matter (PM) remains an important public health concern, 
although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial 
resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved 
exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or 
county-specific ambient concentrations. 

Methods: We developed generalized additive mixed models that describe regional and small-scale spatial and 
temporal gradients (and corresponding uncertainties) in monthly mass concentrations of fine (PM 2 . 5 ), inhalable 
(PM 10 ), and coarse mode particle mass (PM 2 . 5 _io) for the conterminous United States (U.S.). These models expand 
our previously developed models for the Northeastern and Midwestern U.S. by virtue of their larger spatial domain, 
their inclusion of an additional 5 years of PM data to develop predictions through 2007, and their use of refined 
geographic covariates for population density and point-source PM emissions. Covariate selection and model 
validation were performed using 10-fold cross-validation (CV). 

Results: The PM 25 models had high predictive accuracy (CV R 2 =0.77 for both 1988-1998 and 1999-2007). While 
model performance remained strong, the predictive ability of models for PM 10 (CV R 2 =0.58 for both 1988-1998 and 
1999-2007) and PM Z5 _ 10 (CV R 2 =0.46 and 0.52 for 1988-1998 and 1999-2007, respectively) was somewhat lower. 
Regional variation was found in the effects of geographic and meteorological covariates. Models generally 
performed well in both urban and rural areas and across seasons, though predictive performance varied somewhat 
by region (CV R 2 =0.81, 0.81, 0.83, 0.72, 0.69, 0.50, and 0.60 for the Northeast, Midwest, Southeast, Southcentral, 
Southwest, Northwest, and Central Plains regions, respectively, for PM 2 . 5 from 1999-2007). 

Conclusions: Our models provide estimates of monthly-average outdoor concentrations of PM 25 , PM 10 , and PM 25 _ 10 
with high spatial resolution and low bias. Thus, these models are suitable for estimating chronic exposures of 
populations living in the conterminous U.S. from 1988 to 2007. 

Keywords: Particulate matter, Spatio-temporal models, Land use regression, Spatial smoothing, Penalized splines, 
Generalized additive mixed model 
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Background 

Understanding the health impacts resulting from exposure 
to atmospheric particulate matter (PM) air pollution re- 
mains a priority for environmental public health. The phys- 
ical and chemical characteristics of PM affect its relevance 
to human health, as demonstrated by the observed differ- 
ences in behavior, composition, and health impacts for fine 
(PM<2.5 um in aerodynamic diameter: PM 25 ) and coarse 
(2.5<=PM<10 um in aerodynamic diameter: PM 2 . 5 _i 0 ) par- 
ticles [1-3]. These differences make it important to exam- 
ine the health effects of PM using exposure assessment 
methods able to capture variation in the levels of each PM 
size fraction across the spatial and temporal scales relevant 
to health outcomes, especially when studies are conducted 
over large geographic areas. Traditionally, however, epi- 
demiologic studies of the chronic health effects of PM 
air pollution have used crude methods to assess particu- 
late exposures, estimating subject's chronic exposure ei- 
ther by imputing ambient concentrations from the 
nearest monitor or by using area-wide averages [2], thus 
ignoring within-city spatial gradients in air pollutant 
levels and restricting these studies to areas with nearby 
monitoring data. 

To avoid these limitations, more sophisticated methods 
to assess long-term air pollution exposures have been re- 
cently developed that provide location-specific {e.g., at a 
residence) information on exposure and that can be ap- 
plied to large populations living across large geographic 
areas [4-16]. Many of these studies [4-8,11-15] used 
location-specific geographic characteristics such as popu- 
lation density or the proximity of roadways to describe 
small-scale spatial variations in air pollutant levels (i.e., 
land use regression (LUR)). Others have used spatial mod- 
eling of long-term averages or time-period-specific levels 
alone [9,10] or in combination with LUR [6,16]. Addition- 
ally, spatio-temporal modeling methods have also been de- 
veloped which include LUR covariates [14,15] to model air 
pollutant levels at unmeasured locations. In one such ap- 
plication, McMillan et al. [17] incorporate the output of a 
deterministic Eulerian atmospheric chemistry and trans- 
port simulation model (the U.S. Environmental Protection 
Agency's Community Multi-scale Air Quality model) in a 
spatio-temporal model using Bayesian fitting methods. 
Also, spatio-temporal models have included observations 
of satellite-based aerosol optical depth (AOD) [18-24] to 
predict PM concentrations over both small [18,19] and 
large spatial domains [23,24], with mixed results. 

In our previous work, we developed and validated spatio- 
temporal generalized additive mixed models (GAMMs) of 
outdoor PM 25 and PMi 0 levels for the Northeastern and 
Midwestern U.S. that included geographic information 
system (GlS)-based time-invariant spatial covariates and 
time-varying covariates such as meteorological data 
[11-13]. We showed that PM 2 . 5 , PM 10 , and PM 2 5 _ 10 levels 



were estimated with a high degree of accuracy (predicted 
values did not display bias, on average, in comparison with 
measured values) and precision (predicted values were 
strongly correlated with observed values) and that the 
models were able to account for both within- and between- 
city variation in PM concentrations. When our models 
were used to assess chronic PM exposures in epidemio- 
logical studies, we found higher health risks than when sim- 
pler exposure assessment approaches were used [11,25,26], 
likely due to the models' ability to reduce exposure error by 
estimation of within-city variability in PM levels, specifically 
in traffic-related PM. 

Our previous GIS-based spatio-temporal exposure 
models used for the Nurses' Health Study [11-13] were de- 
veloped only for the Northeastern and Midwestern U.S. 
and for 1988-2002. In the present analysis, we expand the 
modeling domain to the conterminous U.S. and include 
PM 2 . 5 and PM 10 monitoring data through 2007. We dem- 
onstrate the predictive accuracy of a computationally effi- 
cient but flexible spatio-temporal modeling approach, 
applied to the conterminous U.S., which combines spatial 
smoothing and regionally-varying non-linear smooth func- 
tions of time-varying and time-invariant geographic and 
meteorological covariate effects. Also, we evaluate the po- 
tential for improved model prediction resulting from the 
use of geographic covariates with higher spatial resolution 
than those used previously for traffic density, population 
density, and point-source emissions density. 

Methods 

We developed three separate GIS-based spatio-temporal 
models of PM levels: 1) PM 2 5 from 1999-2007, 2) PM 2 . 5 
from 1988-1998, and 3) PM 10 from 1988-2007. As with 
our previous models for the Northeastern and Midwestern 
U.S. [11-13], these models used measured PM concentra- 
tions, monitoring site locations, GIS-based location- 
specific characteristics and location- and month-specific 
meteorological data, and spatial smoothing of monthly 
and long-term average levels to describe large- and small- 
scale spatial variability and temporal variability in PM 25 
and PM 10 levels over time. 

Air pollution, geographic, and meteorological data 
Air pollution data 

Monthly mean PM 2 5 and PM 10 values were calculated 
from available monitoring data using the same methods as 
for our previous models [12,13]. Briefly, PM 25 and PMi 0 
measurement data from 1988-2007 were obtained from 
the U.S. Environmental Protection Agency's Air Quality 
System (AQS) network, from the Interagency Monitoring 
of Protected Visual Environments (IMPROVE), Clean Air 
Status and Trends (CASTNet), Stacked Filter Unit (SFU), 
Southeastern Aerosol and Visibility Study (SEAVS), Meas- 
urement of Haze and Visual Effects (MOHAVE), and 
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Pacific Northwest Regional Visibility Experiment Using 
Natural Tracers (PREVENT) networks by accessing the 
Visibility Information Exchange Web System [27], from 
three Harvard-based research studies: the "Five Cities" 
study [28], the "24 Cities" study [29], and the "Six 
Cities" study [30], and from the Southern Aerosol Re- 
search and Characterization Study (SEARCH) network 
[31]; summary statistics on the monitoring data can be 
found in Additional file 1: Table SI. Monthly means were 
calculated by first averaging 24-hr mean values at each 
monitoring site, and then averaging the daily (with the ex- 
ception of CASTNet which provided 2-week means) site 
means within the calendar month, provided that greater 
than approximately 70% of the nominal days had valid 
PM 2 . 5 or PM 10 values. The AQS contributed the bulk of 
the monthly means and sites (94 and 91%, respectively, for 
the 1999-2007 PM 2 . 5 model; 89 and 86%, respectively, for 
the 1988-1998 PM 2 5 model; and 93 and 89%, respectively, 
for the PM 10 model). 

Geographic data 

Characteristics of the PM monitoring sites were quanti- 
fied using a GIS (ArcMap 10.1, Environmental Systems 
Research Institute (ESRI), Redlands, CA). We considered 
only geographic data available (i.e., non-missing) over 
the conterminous U.S. to facilitate generating model pre- 
dictions at any location within this domain. The Albers 
Equal Area Conic U.S. Geological Survey (USGS) projec- 
tion was used for all geographic data. 

We estimated traffic density using data from the U.S. 
Bureau of Transportation Statistics 2005 National 
Highway Planning Network (NHPN) [32] using a ker- 
nel density function (ESRI Spatial Analyst) evaluated on a 
30 m cell size raster. The kernel density approach involves 
deriving locally varying values by applying weights from a 
quadratic kernel within a specified neighborhood [33]. 
The neighborhood for this function was specified at 
100 m based on data from previous studies of near-road 
pollutant decay [34,35]. Distance to nearest road values 
were also generated for each monitoring site for U.S. 
Census Feature Class Code (CFCC) road classes Al 
(primary roads, typically interstates, with limited ac- 
cess), A2 (primary major, non-interstate roads), A3 
(smaller, secondary roads, usually with more than one 
lane in either direction), and A4 (roads used for local 
traffic usually with one lane in either direction) roads 
using ESRI StreetMap Pro 2007 road network data. Dis- 
tance to road values were truncated at 500 m; as a result 
this term represented only micro- to middle-scale local 
variability in PM levels near roadways. 

The proportion of residential (low-intensity and high- 
intensity) and urban (low-intensity and high-intensity 
residential, and industrial/commercial/transportation) 
land use was calculated for each location using 



neighborhoods of 1 and 4 km, using data from the U.S. 
Geological Survey (USGS) 1992 National Land Cover 
Dataset [36]. Tract-level population density data de- 
rived from the 1990 U.S. Census were obtained [37] and 
converted to a 500 m cell-size raster, based on the location 
of the center of each cell. Density values at each cell 
were averaged with the values at four adjacent cells, one 
in each cardinal direction, to reduce spatial discontinu- 
ities across cells. County-level population density data 
from the 1990 U.S. Census were obtained from ESRI 
Data & Maps and were spatially smoothed from county 
geographic centroids to prediction locations using a 
generalized additive model (GAM) with spatial bivariate 
thin-plate penalized splines [38]. 

We estimated the density of point-source emissions 
of PM 2 .5 and PMi 0 using kernel density functions 
(ESRI Spatial Analyst) with neighborhoods of 3, 7.5, 
and 15 km and data from the U.S. EPAs 2002 National 
Emissions Inventory [39]. In our earlier work, 1 and 
10 km buffers were used [11-13]. Larger neighbor- 
hoods were chosen for this analysis to reflect more dis- 
tant sources; however, values at greater distances were 
down-weighted due to use of the kernel density func- 
tion. Also, neighborhoods with<=50% overlap were 
chosen, to minimize collinearity. 

Elevation data were obtained in raster format from the 
USGS's National Elevation Dataset [40] (with a native 
resolution of ~ 30 m) and averaged using a moving win- 
dow with a radius of 300 m. 

The traffic density within 100 m, distance to nearest 
road, tract- and county-level population density, and 
point-source emissions density covariates were natural-log 
transformed, after the addition of a constant, to obtain a 
more uniform distribution and thereby improve stability 
in the estimation of the penalized spline smooth functions, 
using the formula: 

Z ut = ln(ZV(lO-min(ZV)) (1) 

where Z,. ( is the transformed covariate, Z* i t is the covar- 
iate on the native scale, and the constant 10 was chosen 
to reduce the leverage of values near zero. Similarly, the 
elevation covariate was transformed using a square root 
transformation after adding a constant to ensure a mini- 
mum value of one. 

Meteorological data 

Monthly average wind speed, temperature, and total 
precipitation measurements were obtained from the 
National Climatic Data Center (NCDC) and spatially 
smoothed using separate GAMs, as specified below, for 
each month and for each of seven regions of the conter- 
minous U.S. (Figure 1), with region boundaries based 
loosely on the U.S. Census Regions [41]. Monthly 
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predictions of the meteorological parameters at the pre- 
diction locations (monitoring sites, grid points, or geo- 
coded subject residences) were then made using the 
fitted models. The form of these models was: 

Ju = a t. r + g t ,M) + ar, e*,HV(o, 0 2 t,r) (2) 

where y lit represents the measured values for a given me- 
teorological parameter at z'= 1... I r sites in each of seven 
geographical regions indexed by r (Northeast, Midwest, 
Southeast, Southcentral, Southwest, Northwest, and 
Central Plains; Figure 1) and t=l...T monthly time pe- 
riods {T =240 for 1988-2007), and s, is the projected 
spatial coordinate pair for the ith location. g t , r {si) accounts 
for residual monthly spatial variability within the region, 
specified as spatial bivariate thin-plate penalized spline 
terms with basis dimension k t , r = I t , r * 0.9. The value of 0.9 
was chosen such that the basis dimension was as large as 
practicable which allowed the data to determine the com- 
plexity of the fitted functions, but was essentially arbitrary. 
To reduce the potential for over- fitting, a multiplier of 1.4 
(using the gamma argument to gam()), as recommended 
by Wood [39], p. 195, was used. Additionally, data on the 
percentage of stagnant air days per month from the 
NCDC's Air Stagnation Index [42] were obtained, natural- 
log transformed, and spatially smoothed using GAMs 
(Equation 2) for each month and region. 



Statistical models 

The 1999-2007 PM 2 . 5 model 

The generic form of the 1999-2007 PM 2 5 model was: 

y it = a + a tJ + ^d q {X Lq ) + jypA Z i,t.p) + St.M) 

q p 

+ g(s t ) + b t + e Lt : b r N(0, a 2 b ); e iit ~N(0, a 2 e r , t ) 

(3) 

where y itt is the natural-log transformed monthly average 
PM 2 .5 for i= l...I r sites in each of the seven geographic 
regions indexed by r (Figure 1) and / sites in total and 
t =1...T monthly time periods (7=T08 for PM 2 . 5 from 
1999-2007), and s, is the projected spatial coordinate 
pair for the ith location. X iq are GIS -based time- 
invariant spatial covariates for q = l...Q, Z iitiP are time- 
varying covariates for p = l...P, and a t>r is a monthly 
intercept that represents the mean across all sites within 
the region. d q and f p , r are one-dimensional penalized 
spline smooth functions for Q GIS -based time -invariant 
spatial covariates and P time-varying meteorological 
covariates, respectively, each with a basis dimension of 
10. gt, r {si) accounts for residual monthly spatial vari- 
ability within the region, and g(s,) for time-invariant 
spatial variability across the conterminous U.S., with 
both terms specified as spatial bivariate thin-plate penal- 
ized splines with basis dimension values: k Ur = l Ur * 0.9 
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and k={I- Q) * 0.9, respectively. The site-specific random 
effect bi represents unexplained site-specific variability; 
thus our characterization of the model as a GAMM. 

We used a two-stage modeling approach to fit the above 
model (Equation 3). In the first stage (Equation 4), we esti- 
mated site-specific intercepts («;) adjusting for time- 
varying covariates and residual monthly spatial variability 
separately for each of the seven geographic regions. This 
allowed the effects of time-varying covariates to vary 
among the regions, and assumed that the residual monthly 
spatial terms were stationary and isotropic only within the 
region rather than across the entire conterminous U.S. Fit- 
ting the first stage regionally, rather than for the entire 
conterminous U.S. at once, also reduced the computa- 
tional burden of model fitting, necessary due to the large 
number of monthly observations (120,618 for PM 25 from 
1999-2007). Data from areas of adjacent states within 
about 400 km of each region were included in the regional 
first-stage models to minimize potential boundary effects. 

The first stage model equation was: 

y it = Ui + a t/ + ^ f p . r ( z ' ip) +StA s i) + e "'-t; e u ~N(0, <x 2 e r , ( ) 
p 

(4) 

and was fit iteratively for each region in a back-fitting ar- 
rangement [43,11-13] with m + a t , r + T"] f PJ . (Zjj.p) es- 

p 

timated jointly and g Ur {Si) estimated separately by month, 
such that variability in the measured concentrations is 
parsed between the covariates and the residual spatial 
terms. For the spatial models in the first stage, a multi- 
plier of 1.4 (using the gamma argument to gam()) was 
used to avoid over-fitting [39], p. 195, except in the 
Northwest region, where, due to limited data, a value of 
1.8 was used. All individual fits within the back-fitting 
were done using the gam() function in the mgcv package 
[44] of R [45]. 

In the second stage, we fit a spatial model to the u t vec- 
tor of values obtained from the regional first-stage models 
using GIS-based time-invariant spatial covariates and re- 
sidual time-invariant spatial variability. To do this, we 
combined the regional data sets from the first stage, after 
eliminating the overlapping data from the 400 km regional 
buffers. Thus the second stage (Equation 5) was fit to data 
from the entire conterminous U.S. {i.e., all seven regions), 
and was: 

u. = a + Y^dq {Xi, q ) + g(si) + be b r N(0, a 2 b ) 
i 

(5) 

where u- is an estimated site-specific intercept that rep- 
resents the adjusted long-term mean at each location; 
the other terms are as above. The second stage was also 



fit using the gam() function in the mgcv package of R. 
Because we included data from the entire conterminous 
U.S. in the second stage, we investigated the extent to 
which the time-invariant covariate effects varied by re- 
gion of the country. We did this by including interaction 
terms by region, adding: a r + ~^jd r ^ (X ^ q * M;) separ- 

r 

ately to the model for each covariate q, where a r is a cat- 
egorical variable for the main effect of region, and M, is 
a zero/one indicator for whether location i is in a given 
region or not. We also explored regional interactions of 
covariate effects that varied smoothly in space using ten- 
sor products of penalized smoothing spline bases [39]. 

The 1988-1998 PM 2S model 

As in our previous work [13], the generic form of the 
1988-1998 PM 25 model was: 

y u = « + Y. d i( Xi «) +Y.fp-^ Z '^) + *(*) 

q p 

+ gist) + bi + e u ; b r N(0, a 2 b ); e u ~N(0, a 2 el ) 

(6) 

where the terms are as above except that the response 
variable is the natural-log transformed ratio of monthly 
average PM 25 to model predicted PMi 0 . Note that data 
from 1988-2007 were used for model fitting. Thus T = 
240 even though this model was used to predict PM 25 
levels for only the 132 months from 1988-1998. The 
model was fit to 130,594 observations; 419 value were de- 
leted as outliers where monthly average PM 25 was greater 
than 1.5 times predicted PMi 0 . Also, to account for non- 
linearity in the ratio as predicted PM 10 levels increase, pre- 
dicted PM 10 (from Equation 3) was included in the model 
as an additional time-varying covariate Z iAp . Finally, gs eas ,r 
(s,) accounts for residual seasonal spatial variability within 
the region for each of four seasons (winter, spring, sum- 
mer, autumn), rather than for each month. The model 
was fit using a two-stage approach, as for the 1999- 
2007 PM 2 . 5 model above. 

The 1988-2007 PM W model 

The generic form as well as model fitting of the 1988- 
2007 PM 10 model was the same as for the 1999- 
2007 PM 2 5 model, except that the response variable, y iit , 
was the natural-log transformed monthly average PM 10 
and T=240, with 280,060 monthly observations from 
1988-2007. The model was fit using a two-stage ap- 
proach, as for the 1999-2007 PM 2 5 model above. 

Model predictions 

We obtained model predictions from each model by gen- 
erating the covariates at locations of interest (either moni- 
toring locations for model evaluation or grid locations for 
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display purposes) for each month, and then transforming 
to the native scale by exponentiation. To avoid ex- 
trapolation, covariates at prediction locations beyond 
their range among the monitoring locations were set 
to the appropriate minimum or maximum among the 
monitoring locations (doing so within each region for 
time-varying covariates). For the 1988-1998 PM 25 
model, exponentiation yields the predicted PM 25 :PMi 0 
ratio (which was truncated to a maximum value of one, 
affecting only 0.8% of the data), which was multiplied by 

predicted PMi 0 to obtain predicted PM 2 . 5 (pM 2 S i t = exp 

exp(y P M loi ,t)) for 1988-1998. We calcu- 
lated PM 2 .5_io levels at unmeasured locations and months 
by subtracting predicted PM 2 s from predicted PM 10 

(^2.5-10 i,t = PM W i,t~ PM 2.5 i,t> notation as above). 

We generated estimates of uncertainty in model pre- 
dictions (i.e., standard errors) from the 1999-2007 PM 2 5 
and 1988-2007 PM 10 models on the natural-log scale 
using methods described previously [11,12]. For these 
models, 95% prediction intervals on the natural-log scale 
were calculated and exponentiated to assess prediction 
interval coverage. To generate standard errors for the 
1988-1998 PM 2 5 model, we propagated errors in the 
predicted PM 2 5 :PM 10 ratio and predicted PM 10 levels on 
the native scale (see Additional file 1 for details). We 
also propagated errors in the PM 25 _ 10 predictions on 
the native scale using standard methods, assuming inde- 
pendence among the PM 2 5 and PM 10 errors. For 1988- 
1988 PM 2 5 model predictions as well as for PM 25 _ 10 
predictions, prediction interval coverage was assessed 
using 95% prediction intervals based on these native- 
scale standard errors. 



Model validation 

We used 10-fold out-of-sample cross-validation (CV) to 
evaluate model predictive accuracy and thereby inform 
covariate selection. For the 1999-2007 PM 2 5 and 1988- 
2007 PMio models, monitoring sites were selected at 
random and assigned exclusively to one of 10 sets. Be- 
cause few PM 2 5 data were available prior to 1999, we 
used data from the year 2000 for CV of the 1988- 
1998 PM 2 . 5 model. To do this, we first identified a sub- 
set of sites that reported at least 10 monthly PM 2 5 
values in 2000 and at least 70 monthly PM 2 5 values 
across 1988-2007. We then randomly selected from 
among these sites data not to be used for CV (ensuring 
reasonable spatial coverage within each region by ma- 
nipulating the random seed), with the goal of making 
the data for 2000 similar to that in years prior to 1999 
for the purpose of model fitting. We subsequently di- 
vided the remaining monitoring sites that reported data 



in 2000 at random and assigned each site exclusively to 
one of 10 sets. Since the covariate selection process in- 
volved fitting multiple candidate models to the same data, 
set 10 was reserved (i.e., not used for model fitting) to as- 
sess whether the covariate selection process contributed 
to over-fitting. Data from sets one to nine (each set con- 
tains approximately 10% of the data for the 1999- 
2007 PM 2 . 5 and 1988-2007 PM 10 models) were removed 
from the data set sequentially, with the model fit to the 
remaining data and model predictions generated at the lo- 
cations and months of the left-out observations. 

The predictive accuracy of each PM model was deter- 
mined from the squared Pearson correlation between the 
monthly left-out observations and model predictions (CV 
R 2 ), with both on the native rather than the natural-log 
scale. Spatial CV R values were calculated similarly but 
on the long-term means (i.e., one mean per site) of the 
monthly values. Prediction errors were calculated by sub- 
tracting left-out observations from the model predictions. 
Bias in model predictions was determined using the nor- 
malized mean bias factor (NMBF) [Shaocai Yu, personal 
communication] and the slope from major-axis linear 
regression [46] of the natural-log transformed left-out ob- 
servations against the natural-log transformed model pre- 
dictions. The precision of model predictions was obtained 
by taking the mean of the absolute value of the prediction 
errors (CVMAE) and using the normalized mean error fac- 
tor (NMEF) [Shaocai Yu, personal communication]. For- 
mulas for the NMBF and NMEF are provided in Additional 
file 1. Bias and precision values from CV were evaluated 
overall, and by region of the country, urban land use, sea- 
son, monitoring network, and monitoring objective. 

Model development and covariate selection 

For each model, we first fit a 'base' model using the follow- 
ing covariates based on our earlier work [11-13]: distance 
to nearest road for U.S. CFCC road classes A1-A3, 
smoothed county-level population density, urban land use 
within 1 km, elevation, point-source emissions density 
within 7.5 km (of PM 25 emissions for the PM 25 models 
and PMio emissions for the PM 10 model), smoothed 
monthly average wind speed, temperature, total precipita- 
tion, and air stagnation. The 1988-2007 PMi 0 model also 
included tract-level population density. To ensure a parsi- 
monious model specification, we then removed each time- 
varying term to evaluate its contribution and kept in the 
model only those that improved predictive accuracy (using 
the 'base' set of covariates in the second-stage model). 
Using the remaining time-varying covariates, we then 
added or substituted GIS-based time-invariant spatial co- 
variates into the second stage of the model, selecting the 
model with the highest spatial CV R 2 , after removing 
those not statistically significant (p>0.05) per the result of 
Wald tests [38]. As in prior work, only those covariates 
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expected a priori to have a positive or negative physical in- 
fluence on PM levels were considered for inclusion. For 
example, increasing wind speed (a proxy for the amount 
of vertical mixing in the atmosphere) was expected to re- 
sult in decreased PM 25 concentrations due to dilution of 
pollutant emissions. Non-linearity in covariate effects was 
accounted for using penalized spline terms, with the 'sp' 
argument to gam() used to limit each penalized spline 
terms to at most six degrees of freedom (df). Similarly, we 
used the 'sp' argument to gam() to evaluate reducing the 
df of the spatial term in the second stage of the 1999- 
2007 PM 2 . S model. 

Results 

Spatial patterns in model predicted PM 2 . 5 , PM 10 , and 
PM 2 .5_io concentrations 

The maps in Figures 2, 3 and 4 show the spatial distribu- 
tion of long-term average model predicted PM 2 5 , PM 10 , 
and PM 2 . 5 _io levels across the contiguous U.S. (see 
Additional file 2 for an atlas of monthly PM 2 5 levels 
from January 1988 to December 2007). Summary statis- 
tics of measured and predicted levels for each of the PM 
size fractions are presented in Additional file 1: Table SI 
overall, by region, and by network for the 1999-2007 
and 1988-1998 time periods. Generally, PM 2 5 levels were 
highest in southern California, and were elevated across 
the eastern as compared to western U.S. PM 10 and 
PM 2 . 5 _io levels were also highest across the Southwest and 
Central Plains regions (presumably due to greater contri- 
butions from windblown dust than in other areas), and 
were generally more spatially variable than PM 25 . Areas of 
higher elevation had generally lower predicted PM 2 5 and 
PM 10 levels. Increases in model predicted PM levels in 
areas with higher urban land use are also evident, espe- 
cially for PM 2 5 . 



At the spatial resolution (6 km) shown in Figures 2, 3 
and 4, it is not possible to discern the micro- and middle- 
scale impacts of the distance to road covariates, though 
they are evident in Figures 5, 6 and 7, which display model 
predicted PM 2 5 , PM10, and PM 2 5 -io concentrations on a 
30 m grid in a selected area of New York City, New York 
for August 2006. Of note, sharp gradients in tract-level 
population density in this area together with the decreas- 
ing smooth function for this covariate result in several 
somewhat abrupt changes in predicted PMi 0 and therefore 
also predicted PM 2 5 _i 0 . 

Maps of the mean standard errors of monthly PM 2 5 , 
PM 10 , and PM 25 _i 0 model predictions for the contermin- 
ous U.S. are shown in Additional file 1: Figures S4-S6. 
Though the spatial patterns in the mean of the standard er- 
rors (with higher values corresponding to greater average 
uncertainty in monthly model predictions) for each PM size 
fraction are similar to the corresponding spatial pattern in 
mean model predictions, standard errors from the 1999- 
2007 PM 2 5 model are comparatively higher than model 
predictions in the Central Plains region (in eastern Kansas, 
for example), and in northwestern Nevada. Also of note, 
the magnitude of the standard errors from the 1988- 
1998 PM 2 S model is generally greater than that from the 
1999-2007 PM 2 5 model, reflecting uncertainty related to 
the estimation of the PM 2 5 :PMi 0 ratio and, separately, of 
PM 10 levels. A map of the mean predicted PM 2 5 :PM 10 ratio 
across 1988-1998 is presented in Figure 8. The estimated 
ratio is generally higher in the eastern as compared to the 
western U.S., though areas of the Northwest region are also 
higher as compared to the rest of the western U.S. 

CV results 

Results from CV for 1999-2007 for PM 2 . 5 , PM 10 , and 
PM 2 5 _ 10 are presented in Table 1; for 1988-1998 they 
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Figure 3 Means of monthly predicted PM 10 concentrations 
from 1988-2007 on a 6 km grid over the conterminous U.S. 

(5 th to 95 th percentiles shown). 

* 



are presented in Table 2. Overall and when stratified by 
region, spatial CV R 2 values were higher than those for 
corresponding monthly values for PM 2 .5, PMi 0 , and 
PM2.5-10 across 1999-2007 and 1988-1998 (Tables 1 
and 2, respectively). CV statistics by season, tertiles of 
urban land use, monitoring network, and monitoring ob- 
jective are presented for each of the two time periods 
above in Additional file 1: Table S2 for PM 2 . 5 , PM 10 , and 
PM 2 .5_io For both time periods, predictive accuracy was 
generally consistent across tertiles of urban land use, mon- 
itoring network, and monitoring objective for PM 25 , 
PM10, and PM 2 5 _i 0 . Also model predictive performance 
was consistent across seasons, though generally slightly 
lower in the winter season as compared to other seasons. 
Density scatter plots of monthly measured vs. model 



predicted PM 25 , PM10, and PM 25 _i 0 levels from CV are 
shown in Additional file 1: Figure S2. 

CV results for 1999-2007 

PM2.5 Across the conterminous U.S., predictive accuracy 
of the 1999-2007 PM 2 . 5 model was high (CV R 2 =0.77) 
at the monthly average level, though lower in the North- 
west at 0.50. Across regions, model predictions exhibited 
low bias and high precision (NMBF of -1.6% and NMEF 
of 14.3%, respectively), but were less precise in the west 
(Southwest, Northwest, and Central Plains regions). 
Standard errors in monthly PM 25 model predictions 
were reasonably well-scaled (prediction interval coverage 
of 0.98). The model predicted long-term spatial trends 
very well (spatial CV R 2 =0.89). 

PMjo Across the conterminous U.S., predictive accuracy 
for PM 10 monthly model predictions was moderate (CV 
R 2 =0.58) for 1999-2007, though lower in the Southcen- 
tral, Northwest, and Central Plains regions (>0.45). 
Across regions, we found low bias in model predictions 
but only moderate precision (NMBF of -5.1% and 
NMEF of 24.4% across regions, respectively). Standard 
errors were reasonably well-scaled for the PM 10 model 
(prediction interval coverage (across 1988-2007) of 
0.97). The model predicted long-term spatial trends well 
(spatial CV R 2 =0.69). 

PM 2 .5_io Across the conterminous U.S., predictive ac- 
curacy for PM 25 _io was moderate (CV R =0.52). Across 
regions, we found low bias but poorer precision than for 
PM 2 . S or PM 10 (NMBF of -3.2% and NMEF of 38.9%, re- 
spectively). In the Southcentral region, bias in PM 2 5 _i 0 
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Figure 5 Predicted PM 2 . 5 concentrations (from the 1999-2007 
model) on a 30 m grid in a selected area of New York City, 
New York for August 2006 showing local spatial variability 

(5 th to 95 th percentiles shown). 

■ ■ 



monthly values was slightly larger and negative (NMBF 
of -11.4%); in the Northwest region it was also larger 
but positive (NMBF of 18.1%). 

CV results for 1988-1998 

PM 2 .5 Across the conterminous U.S., predictive accuracy 
for PM 25 monthly model predictions was again high 
(CV R 2 = 0.77), though a gain lower in the Northwest re- 
gion at 0.56. Across regions, model predictions exhibited 
low bias and high precision (NMBF of -0.8% and NMEF 
of 14.8%, respectively), but were again less precise in the 
west (Southwest, Northwest, and Central Plains 




^ Low : 4.1 

Figure 7 Predicted PM 2 5 -io concentrations on a 30 m grid in a 
selected area of New York City, New York for August 2006 
showing local spatial variability (5 th to 95 th percentiles shown). 

v ! > 



regions). The prediction interval coverage for the 1988- 
1998 PM 2 5 model of 0.99 indicates that the standard er- 
rors are slightly inflated, likely due to the use of the 
delta method to approximate standard errors on the 
native scale prior to propagation of the uncertainty 

when multiplying the estimated PM 2 5 :PM 10 ratio ^exp 
{y ratio i^j) by predicted PM 10 (exp(y PMio j>t j). 

PM 10 Across the conterminous U.S., predictive accuracy 
for PM 10 monthly model predictions was again moderate 
(CV R 2 =0.58) for 1988-2007, though again lower in the 
Southcentral, Northwest, and Central Plains regions for 
PM 10 (>0.50). Across regions, model prediction exhibited 
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low bias but only moderate precision (NMBF of -3.3% 
and NMEF of 21.8% across regions, respectively). 

PM 2 .5_io Predictive accuracy was moderate for PM 25 _i 0 
(CV R =0.46 across regions). Across regions, model pre- 
dictions again exhibited low bias but precision was poorer 
than for PM 25 or PM 10 during the same time period 
(NMBF of -4.5% and NMEF of 42.5%, respectively). In the 
Northeast region, bias in PM 25 _ 10 monthly values was 
slighdy larger and negative (NMBF of -14.6%), whereas in 
the Northwest region it was also larger but positive 
(NMBF of 27.0%). Predictive accuracy was also substan- 
tially lower in the Southeast region (CV R 2 =0.12). Interest- 
ingly, this decrease in predictive accuracy does not appear 
to be related to the lower levels of measured PM 2 5 _ 10 in 
the Southeast region; by contrast the levels in the 
Northwest region are comparable (Additional file 1: 
Table SI) but predictive accuracy in this region was not 
markedly reduced (CV R 2 =0.54). 

Model covariate effects 
1999-2007 PM 2 . 5 model covariates 

Several GIS -based time-invariant spatial covariates were 
found to be important predictors in the 1999-2007 PM 25 
model, including: elevation, urbanized land use within 
1 km, county-level population density, distance to nearest 
Al, A2, and A3 roads, and point-source emissions density 
within 7.5 km. 

We found significant interactions by region in the ef- 
fects of two GIS-based time-invariant spatial covariates: 
urban land use within 1 km and elevation. 

For urban land use within 1 km, regional effects in the 
Midwest, Southeast, Northwest, and Central Plains regions 
were significantly different from the remaining regions. The 
estimated smooth functions for this covariate, from the 
1999-2007 PM 2 s model, showed that it was generally asso- 
ciated with increasing PM 25 (after adjusting for other 
model covariates), with the pattern varying slightly by re- 
gion (Additional file 1: Figure SI panel A5). 

For elevation, regional effects in the Southwest, 
Northwest, and Central Plains regions were significantly 
different from the remaining regions. Increasing eleva- 
tion was generally associated with decreasing PM 25 , 
with the effects varying substantially by region, espe- 
cially in the Northwest region (Additional file 1: Figure 
SI panel A2). Though not visible in Figures 2, 3, 4, 5 
and 6, regional covariate effects resulted in small spatial 
discontinuities at regional boundaries in monthly pre- 
diction surfaces. 

Surprisingly, traffic density within 100 m performed 
slightly worse than distance to road covariates (A1-A3). 
This may have resulted from poorer spatial accuracy of 
the network of roads used by the NHPN as compared to 
the ESRI StreetMap Pro 2007 road network. Distance to 



the nearest A4 road did not increase predictive accuracy 
and was removed from the 1999-2007 PM 2 s model. 

Increasing county-level population density was positively 
associated with measured PM 2 5 levels, as was increasing 
point-source emissions density within 7.5 km (Additional 
file 1: Figure SI panels A6 and A7, respectively). 

As expected due to dilution and wet deposition pro- 
cesses, respectively, increasing levels of wind speed and 
total precipitation had consistent negative effects on 
PM 2 5 levels in each of the seven regions (with the ex- 
ception of wind speed in the Midwest). The effect of 
temperature on PM 2 5 levels differed slightly by region 
(Additional file 1: Figure SI panel Al), although PM 25 
levels generally decreased with increasing temperature. 
We hypothesize that this counterintuitive result may be 
due to cold temperatures acting as a proxy for local 
wood smoke emissions and less mixing in the atmos- 
phere. In contrast, during warm seasons, higher PM 2 5 
levels due to increased photochemical production of 
secondary aerosol result in a less spatial variability in 
PM 2 s which is better captured by the monthly intercept 
and monthly spatial smooth terms in non-winter sea- 
sons as compared to in winter. We also note that the 
moderate correlation between temperature and air stag- 
nation (Pearson's r = 0.69) may interfere with direct in- 
terpretation of the effect of temperature alone. Air 
stagnation was found to improve predictive accuracy in 
only the Midwest and Southeast regions, with increas- 
ing stagnation associated with increasing PM 25 levels 
(Additional file 1: Figure SI panel All), though in other 
regions, especially the southwest, it was inversely asso- 
ciated with PM 2 . 5 levels. 

The second-stage spatial term g(sj) exhibited substantial 
complexity in the 1999-2007 PM 2 5 model, using 501.6 df. 
In contrast, the monthly spatial terms g t , r (si) used fewer df 
(median across region and months of 22.7). 

1988-1998 PM 2 .s model covariates 

For the 1988-1998 PM 2 . 5 model, only predicted PMi 0 
and elevation remained in the model as spatial covari- 
ates. However, the same four meteorologic covariates as 
for the 1999-2007 PM 2 5 model were included in this 
model. Their effects were similar, except for that of total 
precipitation where the ratio increases and then de- 
creases, reflecting the complexity of differential wet de- 
position processes for fine and coarse mode particles. 
We found a significant interaction by region in the effect 
of elevation, with the effect in the Northwest region sig- 
nificantly different from that in the remaining regions 
(Additional file 1: Figure SI panel B2). 

The second stage spatial term g(si) exhibited substan- 
tial complexity, using 503.3 df; the seasonal spatial terms 
gSeasA s i) use d fewer df (median of 174.1 across regions 



Yanosky et al. Environmental Health 2014, 13:63 
http://www.ehjournal.net/content/1 3/1 /63 



Page 11 of 15 



Table 1 Bias and precision statistics from cross-validation (CV) of PM 2 . 5 , PM 10 , and PM 2 . 5 -io models from 1999-2007 



Pollutant Region* 



Monthly values 



Spatial 







N B 


N excluded 11 


Model R 2D 


CV R 2 


Intercept 


Slope E 


NMBF (%) F 


CVMAE F 


NMEF (%) F 


CV R 


PM 2 . 5 


All 


108,718 


4 


0.84 


0.77 


0.3 


0.87 


-1.6 


1.61 


14.3 


0.89 




Northeast 


24,318 


0 


0.85 


0.81 


0.2 


0.92 


-1.4 


1.44 


11.4 


0.88 




Midwest 


15,767 


0 


0.85 


0.81 


0.2 


0.91 


-0.7 


1.31 


10.6 


0.89 




Southeast 


24,201 


1 


0.88 


0.83 


0.2 


0.92 


-0.4 


1.31 


9.7 


0.82 




Southcentral 


1 2,762 


0 


0.79 


0.72 


0.2 


0.89 


-0.6 


1.44 


14.1 


0.83 




Southwest 


1 3,448 


2 


0.79 


0.69 


0.4 


0.81 


-5.5 


2.65 


26.8 


0.83 




Northwest 


9,052 


0 


0.65 


0.50 


0.7 


0.62 


-4.6 


2.07 


28.9 


0.62 




Central Plains 


9,170 


1 


0.72 


0.60 


0.4 


0.81 


-2.8 


1.66 


23.2 


0.81 


PM, 0 


All 


1 04,509 


22 


0.71 


0.58 


0.7 


0.77 


-5.1 


5.21 


24.4 


0.69 




Northeast 


1 6,982 


0 


0.67 


0.57 


0.7 


0.76 


-4.7 


4.17 


19.8 


0.68 




Midwest 


10,088 


0 


0.63 


0.48 


0.9 


0.71 


-6.0 


4.82 


21.2 


0.56 




Southeast 


20,316 


0 


0.62 


0.49 


0.7 


0.76 


-4.0 


3.89 


17.6 


0.46 




Southcentral 


8,092 


0 


0.61 


0.45 


0.8 


0.74 


-6.0 


6.24 


27.4 


0.44 




Southwest 


24,050 


19 


0.76 


0.62 


0.7 


0.79 


-4.7 


6.92 


27.8 


0.72 




Northwest 


5,943 


1 


0.59 


0.49 


0.8 


0.71 


-1.6 


5.33 


30.2 


0.72 




Central Plains 


1 9,038 


2 


0.61 


0.50 


0.8 


0.71 


-7.6 


5.11 


31.3 


0.66 




All 


41,098 


1,936 


0.67 


0.52 


0.6 


0.76 


-3.2 


4.18 


38.9 


0.61 




Northeast 


8,375 


423 


0.49 


0.35 


0.9 


0.58 


-8.9 


3.46 


42.7 


0.53 




Midwest 


4,567 


233 


0.61 


0.43 


0.7 


0.70 


-1.5 


3.72 


34.4 


0.49 




Southeast 


7,178 


359 


0.45 


0.28 


0.5 


0.75 


-4.2 


3.02 


38.0 


0.36 




Southcentral 


3,614 


23 


0.61 


0.40 


0.9 


0.62 


-11.4 


5.63 


44.6 


0.33 




Southwest 


9,237 


296 


0.74 


0.56 


0.5 


0.81 


-1.6 


5.64 


36.6 


0.64 




Northwest 


2,579 


340 


0.55 


0.47 


0.3 


0.92 


18.1 


3.97 


48.2 


0.58 




Central Plains 


5,548 


262 


0.56 


0.41 


0.6 


0.74 


-2.3 


3.87 


39.8 


0.61 



Corresponds to regions shown in Figure 1. 

includes data from CV sets one through nine; see text for details. 

c Three PM 25 values above 70 ug/m 3 (>99.99th percentile) and one low value, as well as 22 PM 10 values above 150 ug/m 3 (>99.99th percentile} were excluded 
from CV statistics as outliers. Extreme values may have been due to local events such as wildland or other fires, dust storms, etc. 

Calculated on the native rather than natural-log scale and among observations used for CV for comparison to the CV R 2 . For PM 2 5 _ 10 , predicted levels<=0 
were removed. 

E From major axis regression of predictions on measurements (both are natural-log transformed monthly means); see text for details. 

F NMBF is normalized mean bias factor; CVMAE is cross-validation mean absolute error; NMEF is normalized mean error factor; see text for details. 

G Spatial CV R 2 calculated at 1,245, 1,192, and 512 sites with >35 valid monthly-average measurements for PM 25 , PM 10 , and PM 2 .5_io, respectively. 

Calculated as the difference between monthly PM 10 and PM 2 5 measurements and, separately, monthly PM 10 and PM 2 5 model predictions. Of the 1,936 values 

excluded as outliers, 1 1 were removed due to extreme PM 10 or PM 25 measurements; an additional 1,925 were due to measured or predicted PM 25 _ 10 below the 

limit of detection of 0.57 ug/m 3 (<3.4th percentile of measured and <1.6th of predicted PM 25 _ 10 ). 



and seasons), indicating greater residual spatial variabil- 
ity in the seasonal (natural-logged) PM 25 PM 10 ratio 
than in the monthly spatial terms from the 1999- 
2007 PM 2 . 5 or 1988-2007 PM 10 models. 

1988-2007 PM 10 model covariates 

The 1988-2007 PM 10 model included the same set of me- 
teorological and GIS -based time-invariant spatial covariates 
as the 1999-2007 PM 2 5 model, except that in addition it 
included tract-level population density. The effects of these 



covariates were similar to those in the 1999-2007 PM 2 .s 
model, except as discussed below. 

For the 1988-2007 PM 10 model, we found significant 
regional interactions only for urban land use within 1 km, 
with effects in the Northeast, Northwest, and Central 
Plains regions different from that in the remaining regions. 
The estimated smooth functions for this covariate showed 
that it was generally associated with increasing PM 10 (after 
adjusting for other model covariates), with the pattern 
varying slightly by region (Additional file 1: Figure SI 
panel C4). 
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Table 2 Bias and precision statistics from cross-validation (CV) of PM 2 . 5 , PM 10 , and PM 2 . 5 _io models from 1988-1998 



Pollutant Region 



Monthly values 



Spatial 





N B 


N excluded*" 


Model R 2D 


CV R 2 


Intercept 


Slope E 


NMBF (%) F 


CVMAE F 


NMEF (%) F 


CV R 


All 


1 0,823 


0 


0.82 


0.77 


0.2 


0.92 


-0.8 


1.81 


14.8 


0.88 


Northeast 


2,455 


0 


0.77 


0.72 


0.2 


0.93 


-0.5 


1.73 


13.0 


0.85 


Midwest 


1,564 


0 


0.74 


0.70 


0.3 


0.88 


-1.3 


1.64 


12.2 


0.78 


Southeast 


2,385 


0 


0.76 


0.73 


0.1 


0.97 


-1.0 


1.74 


11.7 


0.78 


Southcentral 


1,446 


0 


0.74 


0.67 


0.4 


0.86 


0.2 


1.67 


15.6 


0.77 


Southwest 


1,205 


0 


0.84 


0.77 


0.2 


0.93 


1.3 


2.49 


22.8 


0.89 


Northwest 


809 


0 


0.75 


0.56 


0.4 


0.81 


-6.1 


2.12 


26.8 


0.60 


Central Plains 


959 


0 


0.77 


0.67 


0.1 


0.95 


-1.2 


1.55 


20.2 


0.81 


All 


145,398 


12 


0.71 


0.58 


0.5 


0.82 


-3.3 


5.44 


21.8 


0.66 


Northeast 


35,593 


5 


0.66 


0.57 


0.5 


0.83 


-2.5 


4.71 


18.2 


0.57 


Midwest 


16,276 


0 


0.65 


0.51 


0.7 


0.78 


-2.9 


5.57 


20.4 


0.55 


Southeast 


26,882 


0 


0.72 


0.61 


0.5 


0.84 


-1.7 


4.04 


15.6 


0.57 


Southcentral 


1 2,668 


0 


0.66 


0.50 


0.9 


0.72 


0.1 


5.20 


21.2 


0.50 


Southwest 


23,586 


5 


0.76 


0.60 


0.5 


0.84 


-5.1 


7.52 


27.4 


0.68 


Northwest 


8,874 


2 


0.63 


0.52 


0.9 


0.72 


-4.2 


6.86 


26.0 


0.66 


Central Plains 


21,069 


0 


0.64 


0.50 


0.6 


0.76 


-7.4 


5.57 


31.7 


0.66 


All 


4,032 


205 


0.61 


0.45 


0.7 


0.70 


-4.7 


4.73 


42.6 


0.56 


Northeast 


802 


48 


0.52 


0.32 


0.9 


0.56 


-14.6 


4.28 


46.5 


0.37 


Midwest 


378 


21 


0.64 


0.47 


0.8 


0.66 


-4.4 


4.28 


36.9 


0.44 


Southeast 


771 


58 


0.35 


0.12 


1.1 


0.51 


1.7 


3.66 


43.3 


0.09 


Southcentral 


453 


2 


0.58 


043 


0.8 


0.68 


-5.1 


5.35 


38.2 


0.38 


Southwest 


835 


34 


0.70 


0.53 


0.4 


0.81 


-8.4 


6.15 


44.4 


0.70 


Northwest 


271 


15 


0.60 


0.54 


0.6 


0.85 


27.0 


4.14 


47.5 


0.63 


Central Plains 


522 


27 


0.42 


0.32 


0.7 


0.69 


-4.8 


4.81 


45.9 


0.42 



PM 10 



PM 7 



Corresponds to regions shown in Figure 1. 

includes data from CV sets one through nine; see text for details. 

c 12 PM 10 values above 150 ug/m 3 (>99.99th percentile) were excluded from CV statistics as outliers. Extreme values may have been due to local events such as 
wildland or other fires, dust storms, etc. 

Calculated on the native rather than natural-log scale and among observations used for CV (for only the year 2000 for PM 25 and PM 25 _ q0 ) for comparison to the 
CV R 2 . For PM 2 .5_io, predicted levels<=0 were removed. 

E From major axis regression of predictions on measurements (both are natural-log transformed monthly means); see text for details. 

F NMBF is normalized mean bias factor; CVMAE is cross-validation mean absolute error; NMEF is normalized mean error factor; see text for details. 

G Spatial CV R 2 calculated at 1,031 and 422 sites with >3 valid monthly-average measurements for PM2.5 and PM 2-5 _ 10 , respectively, and at 1,502 sites with >35 valid 

monthly-average measurements for PM 10 . 

H Measured and predicted levels (rather than the natural-log of the PM 2 .5 to PM 10 ratio) were compared. 

Calculated as the difference between monthly PM 10 and PM 2 .5 measurements and, separately, monthly PM 10 and PM2.5 model predictions. Of the 207 values 
excluded as outliers, 8 were removed due to extreme PM 10 or PM 2 .5 measurements; an additional 197 were due to measured or predicted PM 25 _ 10 below the limit 
of detection of 0.57 ug/m 3 (<3.6th percentile of measured and <1 .5th of predicted PM 2 5 _ 10 ). 



Tract-level population density was negatively associated 
with measured PMi 0 levels (Additional file 1: Figure SI 
panel C8). 

The second-stage spatial term g(si) exhibited substantial 
complexity in the 1988-2007 PMi 0 model, using 882.6 df. 
In contrast, the monthly spatial terms g f , r (s;) used fewer df 
(median across regions and months of 20.1). 

Modeling assumptions 

Our modeling approach assumes stationary and isotropic 
spatial variation, that covariate effects are additive, and 
that model residuals are independent and normally 



distributed, with mean zero and constant variance. We 
evaluated the assumption of stationarity in the second 
stage spatial term in alternative second stage models that 
allowed the smoothing parameter to vary across the do- 
main (adaptive bases), including those that allowed station- 
arity to vary by urbanness, but these did not substantially 
change model fit nor increase predictive accuracy. We also 
evaluated whether the effects of the GIS -based time- 
invariant spatial covariates (other than urban land use) var- 
ied with urbanness by stratifying by tertiles of urban land 
use within 1 km; we found no evidence of differential co- 
variate effects by urbanness. Finally, we evaluated temporal 
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autocorrrelation in model residuals; the resulting plots are 
provided in Additional file 1: Figure S3. Though the plot 
for the 1988-1998 PM 25 model residuals shows only lim- 
ited evidence of autocorrelation, plots for the other two 
models show evidence of modest autocorrelation at a lag of 
one month and, to a lesser extent, seasonal dependence (at 
lags ~12 months) not accounted for in the modeling. 

Discussion 

Our modeling approach provides predictions of monthly 
outdoor PM 2 .5, PM 10 , and PM 25 _i 0 levels at any location 
within the conterminous U.S. with high spatial and tem- 
poral {i.e., monthly) resolution over a 20-year period 
(1988-2007). Model performance was particularly strong 
for PM 2 . 5 , with a CV R 2 of 0.77 for both 1988-1988 and 
1999-2007 time periods. Although lower, model per- 
formance for PMio and PM 25 _i 0 was reasonable (CV 
R 2 =0.58 and 0.52, respectively). The strong model per- 
formance can be attributed to the fact that our models in- 
corporate regionally-varying spatial and spatio-temporal 
covariate effects and account for residual spatio-temporal 
interaction using regional time-varying (monthly for the 
1999-2007 PM 2 . 5 and 1988-2007 PM 10 models and sea- 
sonal for the 1988-1998 PM 25 model) spatial smooth 
terms in combination with spatially smooth terms of the 
long-term mean. This approach gives our models the abil- 
ity to account for micro (<100 m) , middle (100-500 m), 
neighborhood (500 m-4 km), and urban (4-50 km)-scale 
spatial gradients as well as larger-scale regional effects that 
vary over time. Further, this approach has the added bene- 
fit of straightforward interpretation of covariate effects on 
predicted PM levels, albeit where not obscured by collin- 
earity or concurvity. Since model predictions can be made 
at a subject's residence or other relevant point location, ra- 
ther than interpolated from a pre-defined grid, our models 
offer high spatial resolution which may reduce exposure 
error when estimating chronic exposures in epidemiologic 
studies, as has been shown in previous analyses [11,25,26]. 
The models have been used to provide PM 2 5 , PM 10 , and 
PM 2 5 _io monthly exposure estimates at subject residences 
in recent epidemiologic analyses [47,48] . 

Of the covariates evaluated for inclusion in the three 
models, several were found to be important predictors in 
each of the three models: wind speed, air temperature, 
total precipitation, air stagnation, and elevation. Also, the 
1999-2007 PM 2 . 5 and 1988-2007 PM 10 models both in- 
cluded county population density, point-source emissions 
density (for the corresponding PM size fraction), distance 
to nearest road for road classes A1-A3, and urban land 
use within 1 km. Also, in the 1999-2007 PM 2 5 and 
1988-1998 PM 2 S models, we found regional variation 
in the effects of elevation, and, in the 1999-2007 PM 2 5 
and 1988-2007 PMi 0 models, of urban land use within 
1 km. The robustness of our findings may be due to our 



covariate selection procedures which were performed 
using the fully specified spatio-temporal model, allow- 
ing for residual spatial trends and changing covariate ef- 
fects, including potential nonlinearity in those effects, 
to compete with each candidate covariate, in contrast to 
approaches where covariate selection is based on mul- 
tiple linear regression before spatial modeling is per- 
formed. Su et al. [49] used a more complicated variable 
selection approach, but one that may lead to over-fitting 
and that is not practical for models with large geo- 
graphic and temporal scopes such as ours, with approxi- 
mately 125,000-250,000 observations and run times for 
one model fit of between 24 and 96 hours. Kloog et al. 
[22] and Sampson et al. [16] have described attractive 
alternative approaches, which allow for the inclusion of 
large numbers of covariates while shrinking their ef- 
fects, but these approaches also increase model com- 
plexity and may thus not be practical for models applied 
to the entire conterminous U.S. that span many years of 
monthly data. 

Spatial trends in long-term (1999-2007) mean PM 25 
levels from our modeling approach, presented in Figure 2, 
are broadly similar to those in a recent spatial analysis of 
annual-average PM 2 5 levels in the year 2000 [16] and to 
those in our earlier work in the Northeastern and 
Midwestern US [11-13]. It is possible that with additional 
covariates, such as satellite-derived AOD measures 
[19-24], model predictive accuracy (i.e., CV R ) may im- 
prove, especially in areas far from monitors [24] . Although 
models have been developed that incorporate satellite- 
derived measures, to date there have been limited compar- 
isons to GIS-based spatio-temporal models. For example, 
Lee et al. [24] used satellite-derived AOD data in combin- 
ation with a low spatial resolution (2° x 2.5°) global 3-D 
chemical transport model (GEOS-Chem) to estimate 
PM 2 5 levels in the conterminous U.S., but compared it to 
a kriging model without geographic or meteorological co- 
variates that could explain small-scale spatial variability. 
Paciorek et al. [18] compared hierarchical spatio-temporal 
models that included geographic and meteorological co- 
variates with satellite-derived AOD vs. those without, but 
only in mid-Atlantic region of the U.S., at the monthly 
time scale, and over one year: 2004. These models had 
high predictive ability, but inclusion of AOD did not im- 
prove predictive accuracy (monthly CV R 2 =0.827 without 
AOD and 0.825 with calibrated Moderate Resolution 
Imaging Spectroradiometer or Geostationary Oper- 
ational Environmental Satellite AOD). More recent 
studies demonstrate the utility of daily as opposed to 
monthly satellite-derived AOD measures in New England 
and the mid- Atlantic states [21,22], reporting yearly CV R 2 
values of 0.83 and 0.81, respectively. However, these 
models cannot be used to predict PM levels before the 
year 2000, given that they require satellite-derived AOD 
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data that are not available before that time period. Given 
the air quality monitoring, meteorological, geographic, and 
other data available from 1988-2007, our modeling ap- 
proach provides a reasonable balance of computational 
feasibility (using standard software) and complexity while 
representing the small- and large-scale spatial, temporal, 
and spatio-temporal features of the data. 

Conclusions 

Our models provide estimates of monthly-average outdoor 
concentrations of PM 25 , PM 10 , and PM 2S _ 10 with high 
spatial resolution and low bias. For PM 25 and PM 10 , the 
models performed well in urban and rural areas and across 
seasons, though performance varied somewhat by region of 
the conterminous U.S. For PM 2 .5_io, model performance 
was poorer, particularly in the Southeast and Southcentral 
regions. Regional variation was found in the effects of geo- 
graphic and meteorological covariates. The models are suit- 
able for estimating chronic PM exposures of populations 
living in the conterminous U.S. from 1988 to 2007. 
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